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Abstract 


Estimating mutual information (MI) from sam¬ 
ples is a fundamental problem in statistics, ma¬ 
chine learning, and data analysis. Recently it was 
shown that a popular class of non-parametric MI 
estimators perform very poorly for strongly de¬ 
pendent variables and have sample complexity 
that scales exponentially with the true MI. This 
undesired behavior was attributed to the reliance 
of those estimators on local uniformity of the un¬ 
derlying (and unknown) probability density func¬ 
tion. Here we present a novel semi-parametric 
estimator of mutual information, where at each 
sample point, densities are locally approximated 
by a Gaussians distribution. We demonstrate that 
the estimator is asymptotically unbiased. We 
also show that the proposed estimator has a supe¬ 
rior performance compared to several baselines, 
and is able to accurately measure relationship 
strengths over many orders of magnitude. 


1 Introduction 

Mutual information (MI) is a fundamental measure of de¬ 
pendence between two random variables. While it initially 
arose in the theory of communication as a natural measure 
of ability to communicate over noisy channels ( [Shannon] 
|1948| ), mutual information has since been used in differ¬ 
ent disciplines such as machine learning, information re¬ 
trieval, neuroscience, and computational biology, to name 
a few. This widespread use is due in part to the general¬ 
ity of the measure, which allows it to characterize depen¬ 
dency strength for both linear and non-linear relationships 
between arbitrary random variables. 

Let us consider the following basic problem, where, given 
a set of i.i.d. samples from an unknown, absolutely contin¬ 
uous joint distribution, our goal is to estimate the mutual 
information from these samples. A naive method would 


be first to learn the underlying probability distribution us¬ 
ing either parametric or non-parametric methods, and then 
calculate the mutual information from the obtained distri¬ 
bution. Unfortunately, this naive approach often fails, as it 
requires a very large number of samples, especially in high 
dimensions. A different approach is to estimate mutual in¬ 
formation directly from samples. For instance, rather than 
estimating the whole probability distribution, one could es¬ 
timate the density (and its marginals) only at each sam¬ 
ple point, and then plug those estimates into the expres¬ 
sion for mutual information. This type of direct estimators 
been shown to be a more feasible method for estimating 
MI in higher dimensions. An important and very popu¬ 
lar class of such estimators is based on k-nearest-neighbor 
(kNN) graphs and their generalizations (jSingh et al.j [2003 [ 
[Kraskov et ari|2004l|Pal et al.||M0l l. 

Despite the widespread popularity of the direct estimators, 
it was recently demonstrated that those methods fail to ac¬ 
curately estimate mutual information for strongly depen¬ 
dent variables (Gao et al. 2015| l. Specifically, it was shown 
that accurate estimation of mutual information between two 
strongly dependent variables requires a number of samples 
that scales exponentially with the true mutual information. 
This undesired behavior was contributed to the assumption 
of local uniformity of the underlying distribution postulated 
by those estimators. To address this shortcoming, ( jGaoj 
jet al.j |2015| ) proposed to add a correction term to com¬ 
pensate for non-uniformity, based on local PCA-induced 
neighborhoods. Although intuitive, the resulting estimator 
relied on a heuristically tuned threshold parameter and had 
no theoretical performance guarantees (Gao et al. 2015| l. 

Our main contribution is to propose a novel mutual infor¬ 
mation estimator based on local Gaussian approximation, 
with provable performance guarantees, and superior em¬ 
pirical performance compared to existing estimators over 
a wide range of relationship strength. Instead of assuming 
a uniform distribution in the local neighborhood, our new 
estimator assumes a Gaussian distribution locally around 
each point. The new estimator leverages previous results on 
local likelihood density estimation (jHjort and Jones 1996 




































|Loader| [1996) 1. As our main theoretical result, we demon¬ 
strate that the new estimator is asymptotically unbiased. 
We also demonstrate that the proposed estimator performs 
as well as existing baseline estimators for weak relation¬ 
ships, but outperforms all of those estimators for stronger 
relationships. 


The paper is organized as follows. In the next section, we 
review the basic dehnitions of information-theoretic con¬ 
cepts such as mutual information and formally dehne our 
problem. In section]^ we review the limitations of cur¬ 


rent mutual information estimators as pointed out in (Gao 


et al. 2015| l. Section [4[introduces local likelihood density 
estimation. In Section]^ we use this density estimator to 
propose a novel entropy and mutual information estimator, 
and summarize certain theoretical properties of those esti¬ 
mator, which are then proved in Section]^ Section [^pro¬ 
vides numerical experiments demonstrating the superiority 
of the proposed estimator. We conclude the paper with a 
brief survey of related work followed by the discussion of 
our main results and some open problems. 


2 Formal Problem Definition 

In this section we briefly review the formal definition of 
Shannon entropy and mutual information, before formally 
defining the objective of our paper. 

Definition 1 Let x denote a d-dimensional absolutely con¬ 
tinuous random variable with probability density function 
/ : —>■ K. The Shannon differential entropy is defined 

as 

i/(x) = - //(x)log/(x)dx (1) 


Definition 2 Let x and y denote d-dimensional and b- 
dimensional absolutely continuous random variables with 
probability density function fx ■ —>■ K and /y : —>■ 

K, respectively. Let fxY denote the joint probability den¬ 
sity function o/x and y. The mutual information between 
X and y is defined as 


/ (x : y) = y J fxY (x, y) log 


yGH*’ xGR"^ 


fxY (x, y) 
fx (x) /y (y) 


dxdy 


( 2 ) 


It is easy to show that 

/(x:y) = i/(x) + i/(y)-i?(x,y) , (3) 

where Tf(x, y) stands for the joint entropy of (x, y), and 
can be calculated from Eq.[T] using the joint density fxY- 
We use the natural logarithms so that information is mea¬ 
sured in nats. 


It is sometime useful to represent entropy and mutual in¬ 
formation as the following expectations: 


if(x) 

Hx-.y) 


Ex[- log/(x)] 

fxY (x, y) 


E 


XY 


log 


fx (x) /y (y) 


(4) 

(5) 


Assume now we are given N i.i.d. samples {X^y) = 
{(x, from the unknown joint distribution fxY- 

Our goal is then to construct a mutual information estima¬ 
tor j(x : y) based on those samples. 


3 Limitations of Nonparametric MI 
Estimators 


As pointed out in Section[T] one of the most popular class of 
mutual information estimators is based on k-nearest neigh¬ 


bor (kNN) graphs and their generalizations (Singh et al. 


12003) [Kraskov et ah] |2004] |Pal et ari|2010] l. However, it 
was recently shown that for strongly dependent variables, 
those estimators tend to underestimate the mutual informa¬ 
tion ( |Gao et akj 12015) 1. To understand this problem, let us 
focus on kNN-based estimator as an example. The kNN es¬ 
timator assumes uniform density within the kNN rectangle 
(containing k-nearest neighbors), as shown in Figure )l(a)) 
Generally speaking, this assumption can be made valid for 
any relationship as long as we have sufficient number of 
samples. However, for limited sample size, this assump¬ 
tion becomes problematic when the relationship between 
the two variables becomes sufficiently strong. In fact, as 
shown in Fig. 1(b)) the obtained local neighborhood in¬ 
duced by kNN is beyond the support of the probability dis¬ 
tribution (shaded area). 

This undesired behavior is closely related to the so-called 
boundary effect that occurs in nonparametric density es¬ 
timation problem. Namely, for strongly dependent ran¬ 
dom variables, almost all the sample points are close to 
the boundary of the support (as illustrated in Figure [T(b)] l, 
making the density estimation problem difficult. 

To relax the local uniformity assumption in kNN-based es¬ 
timators, (Gao et al. 2015) 1 proposed to replace the axis- 
aligned rectangle with a PCA-aligned rectangle locally, and 
use the volume of this rectangle for estimating the unknown 
density at a given point. Mathematically, the above revi¬ 
sion was implemented by introducing a novel term that ac¬ 
counted for local non-uniformity. It was shown the the re¬ 
vised estimator significantly outperformed the existing esti¬ 
mators for strongly dependent variables. Nevertheless, the 
estimator suggested in ( |Gao et alT 20151 relied on a heuris¬ 
tic for determining when to use the correction term, and 
did not have any theoretical guarantees. In the remaining 
of this paper, we suggest a novel estimator based on local 
gaussian approximation, as more general approach to over¬ 
come the above limitations. The main idea is that, instead 




































by solving the following optimization problem, 
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(a) (b) 

Figure 1: For a given sample point we show the 
max-norm rectangle containing k nearest neighbors (a) for 
points drawn from a uniform distribution, fc = 3, (shaded 
area), and (b) for points drawn from a distribution over two 
strongly correlated variables, fc = 4, (the area within dotted 
lines). 


of assuming a uniform distribution around the local kNN- 
or a PCA-aligned rectangle, we approximate the unknown 
density at each sample point by a local Gaussian distribu¬ 
tion, which is estimated using the k-nearest neighborhood 
of that point. In addition to demonstrating superior empir¬ 
ical performance of the proposed estimator, we also show 
that it is asymptotically unbiased. 


4 Local Gaussian Density Estimation 


In this section, we introduce a density estimation method 
called local Gaussian density estimation, or LGDE ([Hjort 


and Jone^ 1996|l, which serves as the basic building block 


for the proposed mutual information estimator. 


Consider N i.i.d. samples xi,X 2 , ...,xjv drawn from 
an unknown density /(x), where x is a d-dimensional 
continuous random variable. The central idea behind 
LGDE is to locally approximate the unknown probabil¬ 
ity density at point x using a Gaussian parametric fam¬ 
ily JVd (/r(x), S(x)), where /x(x) and S(x) are the (x- 
dependent) mean and covariance matrix of each local ap¬ 
proximation. This intuition is formalized in the following 
definition: 


Definition 3 (Local Gaussian Density Estimator) Let x 

denote a d-dimensional absolutely continuous ran¬ 
dom variable with probability density function /(x), 
and let {xi, X 2 ,..., xjv} be N i.i.d. samples 
drawn from /(x). Furthermore, let iTH(x) be 
a product kernel with diagonal bandwidth matrix 
H = diag{hi,h 2 , ■■■,hd), so that Kh (x) = 

hf^K (^hf^xi) hf^K lhf^X 2 ) ...hf^K {hf^xf}, where 
Kf) can be any one-dimensional kernel function. Then 
the Local Gaussian Density Estimator, or LGDE, of /(x) 
is given by 

7(x) =A/'d(x;/r(x),S(x)) , (6) 

Here p, S are different for each point x, and are obtained 


/x(x), S(x) = argmaxE (x,/r, E) , (7) 

/i..S 

where C (x, /r, S) is the local likelihood function defined 
as follows: 

1 ^ 

E(x,/x,S) = — ^Kh(x,- x)logAAd(xi;/^,S) 

i^l 

- J KH{t-x)J\fdit\p,'E)dt (8) 


The hrst term in the right hand side of Eq. is the local¬ 
ized version of Gaussian log-likelihood. One can see that 
without the kernel function, Eq. [^becomes similar to the 
global log-likelihood function of the Gaussian parametric 
family. However, since we do not have sufficient infor¬ 
mation to specify a global distribution, we make a local 
smoothness assumption by adding this kernel function. The 
second term of right hand side in Eq.j^is a penalty term to 
ensure the consistency of the density estimator. 


The key difference between kNN density estimator and 
LGDE is that the former assumes that the density is lo¬ 
cally uniform over the neighborhood of each sample point, 
whereas the latter method relaxes local uniformity to local 
linearitj^ which allows to compensates for the boundary 
bias. In fact, any non-uniform parametric probability dis¬ 
tribution is suitable for fitting a local distribution under the 
local likelihood, and the Gaussian distribution used here is 
simply one realization. 


Theorem below establishes the consistency property of 


this local Gaussian estimator; for a detailed proof see (Hjort 
[and Jone'sl|1996| l. 


Theorem 1 ( ( |Hjort and Jones||1996^ ) Let X denote a d- 
dimensional absolutely continuous random variable with 
probability density function f(x), and let {xi, X 2 ,..., x^r} 
be N i.i.d. samples drawn from /(x). Let f (x) be the Lo¬ 
cal Gaussian Density Estimator with diagonal bandwidth 
matrix diag{hi, h 2 ,..., hd), where the diagonal elements 
hi-s satisfy the following conditions: 


lim hi = 0 , lim Nhi = oo, i = 1,2,. 
N—¥oo N^OO 

..,d. (9) 

Then the following holds: 


lim E|7 (x) - /(x)| = 0 

Af—>-oo 

(10) 

lim E|7 (x)-/(x)|2=0 

N—^ao 

(11) 


*To elaborate on the local linearity, we note that Gaussian 
distribution is essentially a special case of Elliptical distribution 
/(x) = k*g{{x — {x — p)). Therefore, the local Gaus¬ 

sian approximation actually assumes a rotated hyper-ellipsoid lo¬ 
cally at each point. 




























The above theorem states that LGDE is asymptotically un¬ 
biased and L2-consistent. 

5 LGDE-based Estimators for Entropy and 
Mutual Information 


Theorem 4 (Lebesgue dominated convergence theorem) 

{In} be a sequence of functions, and assume this 
sequence converges point-wise to a function f, i.e., 
/ 7 v(x) —>■ /(x) for any x S Furthermore, let us 

assume that /jv is dominated by an integrable function g, 
e.g., we have for any x 


We now introduce our estimators for entropy and mutual 
information that are inspired by the local density estimation 
approach defined in the previous section. 

Let us again consider N i.i.d samples {X,y) = 

{(x, drawn from an unknown joint distribution 

fxY, where x and y are random vectors of dimensionality 
d and b, respectively. Let us construct the following esti¬ 
mators for entropy. 


H{x) = -^^log/(xi), 

2=1 


and mutual information 


I(x:y) 


1 

N 


N 


II log 


/(x^,y^) 

/(x*)/(y*) 


( 12 ) 


(13) 


|/tv(x)| < p(x) 


Then we have 


lim / 

JxGX 


|/Ar(x)-/(x)|dx = 0 


6.1 Proof of Theorem |2] 

Consider N i.i.d. samples drawn from the prob¬ 

ability density /(x), and let TAr(x) denote the empirical 
cumulative distribution function. 

Let us define the following two quantities: 


where /(x), /(y), /(x, y) are the local Gaussian density 
estimators for /x(x), /v(y), /xy(x, y) respectively, de¬ 
fined in the previous section. 

Recall that the entropy and mutual information can be writ¬ 
ten as appropriately defined expectations; see Eqs. 1^ and 
1^ Then the proposed estimator simply replaces the expec¬ 
tation by the sample averages, and then plugs in density 
estimators from Section|^into those expectations. 

The next two theorems state that the proposed estimators 
are asymptotically unbiased. 

Theorem 2 (Asymptotic Unbiasedness of Entropy Estimator) 

If the conditions in Eq.^hold, then the entropy estimator 
given by Eq.^I^is asymptotically unbiased, i.e., 

\im EH (x) = H{x) (14) 

N^OO 

Theorem 3 (Asymptotic Unbiasedness of MI Estimator) 

If the conditions in Eq.^hold, then the mutual information 
estimator given by Eq. asymptotically unbiased: 


Hi = -^^lnE/(x,) 

H2 = -^IIln/(x,) 

2 = 1 

Then we have, 

E|ij(x)-iT(x)| 

= E|(ij - Hi) + {Hi - H 2 ) + {H 2 - H)\ 
<E\H-Hi\+ E|iJi - iJal + E\H2 - iL|(18) 


(16) 

(17) 


We now proceed to show that each of the terms in Eq. [T^ 
individually converges to 0 in the limit N ^ 00 , which will 
then yield Eq. 14 Eirst, we note that according to the mean 
value theorem, for any x, there exist and in (0,1), 
such that 


lim E/(x : y) = /(x : y) (15) 

TV —>00 


We provide the proofs of the above theorems in the next 
section. 


In7(x) = lnE7(x)-f (19) 

(7(x) - e7(x)) In (t^f (x) -f (1 - fx) e7(x)) 


6 Proofs of the Theorems 


and 


Before getting to the actual proofs, we first introduce the 
Lebesgue’s dominated convergence theorem. 


lnE7(x) = ln/(x)-f (20) 

(e7(x) - / (x)) In' (t'J (x) -f (1 - fj Ef (x)) 




For the first term in Eq.[T^ we use Eq.[T^to obtain 
E 


H - Hi 
= E 


= E 


< 


[In / (x) - InE/ (x)]dFAr (x) 
|/(x)-E7(x)| 


(x) + (1 - fx) E/ (x) 
|7(x)-E/(x)| 


dFpf (x) 


-E 


1 -1 


E/(x) 


dFjv (x) 


= -J—E ( — ^ ~ 


1 - t 
1 


N 
E I E 


i=i E/ (xi) 

77(u)-E7(u)f 

. e7(u) ^ 


X = u 


1/ (u) — E/ (u)| du 

E/(u) 


( 21 ) 


where t is the maximum value among all fx- Using Theo¬ 


rem we have \f (u) — E/ (u)| —>■ 0 as —>■ oo. Eur- 
thermore, it is possible to show that 3Nq, so that for any 

N > A^oonehas |7(u) - e7(u)|- 4^ < 2/(u). Thus, 
using Theorem® we obtain 

lim E|iF-iTi| =0 (22) 

N—^OO 

Similarly, using Eq.|^ E |iFi — iF 2 | can be written as 

ElTTi -7T2| 


= E 


= E 


< -E 
- f 


[InE/ (x) - In / (x)]dFAr (x) 
|e7(x)-/(x)| 


ix/(x) + (l-fx)E/ (x) 

/■|e7(x)-/(x)| 


dPN (x) 


/(x) 


-dFN (x) 


= -E f — ^ (xi) - / (x»; 


f \N 


= ^//(x: 


/(x*) 

|e7(x)-/(x)| 

/(x) 


dx 


= ^ / |E/(x) - /(x)|dx 


(23) 


where t' is the minimum value among all 

Invoking Theorem[T] again, we observe that the last term in 
Eq. 23 |E/ (x) — / (x)| —>■ 0 as A^ —>■ oo, and is bounded 
by 2/(x) for sufficiently large N (e.g., when when /(u) 
and E/(u) are sufficiently close). Therefore, by Theo¬ 
rem]^ we have 


lim EjiJi - iJal = 0 

AT—>-oo 


(24) 


Einally, for the last term in Eq. [T^ we note that 
1 ^ 

EiT 2 = -^E^ln/(x,)=E[-ln/(x)] (25) 

i=l 

Thus, Eid 2 is simply the entropy in Definition[T] see Eq.|^ 
Therefore, 

lim E |id 2 - iF| = 0 (26) 

N—^OO 

Combining Eqs. [^ [^ [^ and [T^ we arrive at Eq. 
which concludes the proof. 

6.2 Proof of Theorem 12 

Eor mutual information estimation, we use Eq. [^to get 


E|J(x:y)-/(x:y)| < EjiT (x) - # (x)| 

+ E|iT(y)-ij(y)| 

+ E|7T(x,y)-ij(x,y)|(27) 

Using Theorem S we see that all three terms on the right 
hand side in Eq. [^converge to zero as N ^ oo, therefore 
limjv_>.oo EjJ (x : y) — / (x : y)| =0, thus concluding the 
proof. 

7 Experiments 

7.1 Implementation Details 

Our main computational task is to maximize the local like¬ 
lihood function in Eq. [^ Since computing the second term 
on the right hand side of Eq. [^requires integration that can 
be time-consuming, we choose the kernel function K{-) to 
be a Gaussian kernel, KH(t — x) = A/d(t;x, H) so that 
the integral can be performed analytically, yielding 


y Kh (t - x) Md (t; S)dt = A/'d (x; /r, H + S) (28) 
Thus, Eq. [^reduces to 

1 ^ 

/:(x,/x,S) = — ^A/'d(x,;x,H)logA/'d(xi;/r,S) 

- AAd(x;/x,H + S) (29) 

Maximizing Eq.j^is a constrained non-convex optimiza¬ 
tion problem with the condition that the covariance matrix 
S is positive semi-definite. We use Cholesky parameteri¬ 
zation to enforce the positive semi-definiteness of S, which 
allows to reduce our constrained optimization problem into 
an unconstrained one. Also, since we would like to pre¬ 
serve the local structure of the data, we select the band¬ 
width to be close to the distance between pair of k-nearest 
points (averaged over all the points). 



































We use Newton-Ralphson method to do the maximization 
although the function itself is not exactly concave. The 
full algorithm for our estimator is given in Algorithm [T] 
which takes Algorithm]^ as a subroutine. Note that in Al¬ 
gorithm]^ the Wolfe condition is a set of inequalities in 
performing quasi-Newton methods (Wolfe 1969|l. 


Algorithm 1 Mutual Information Estimation with Local 
Gaussian Approximation 
Input: points (x,y)(i), (x,y)(2),(x,y)W 
Output: /(x;y) 

Calculate entropy H{x) using samples x*-^\ 

Calculate entropy H{y) using samples y^^\ 

Calculate joint entropy iT(x, y) using input samples 

Return estimated mutual information I = H{k) + 
iT(y)-ij(x,y) 


Algorithm 2 Entropy Estimation with Local Gaussian 
Approximation 

Input: points ..., 

Output: H{u) 

Initialize H{u) = 0 
for each point x^*^ do 

initialize = fiQ, L = Lq 
while not S = L * L'^) converge do 

Calculate £(x(*\ /x, S = X * X^) 

Calculate gradient vector G of £(x(*\ /x, S = 

X * X^), with respect to /x, X 
Calculate Hessian matrix of H of £(x*^®), /x, S = 
X * X^), with respect to /x, X 
Do Hessian modihcation to ensure the positive 
semi-dehniteness of H 
Calculate descent direction D = — aH“^G, 
where we compute a to satisfy Wolfe condition 
Update /X, X with (/x, X) -f D 
end while 

7(xW)=AA(x;/x,S = X*X^) 

end for 


where U{0,9) is the uniform distribution over the inter¬ 
val (0, 6), and X is drawn randomly uniformly from [0,1]. 
Similar relationships were studied in ( Reshef et al.[|20lT] l, 
( Kinney and Atwal] 2014| l and (Gao et al. 2015| l. 


Y=X+U Y=X^ +U 



Figure 2: Functional relationship test for mutual informa¬ 
tion estimators. The horizontal axis is the value of 0 which 
controls the noise level; the vertical axis is the mutual in¬ 
formation in nats. For the Kraskov and GNN estimators we 
used nearest neighbor parameter fc = 5. For the local Gaus¬ 
sian estimator, we choose the bandwidth to be the distance 
between a point and its 5th nearest neighbor. 


In a single step, evaluating the gradient and Hessian in Al¬ 
gorithm]^ would take 0{N) time because Eq. ]^is a sum¬ 
mation over all the points. However, for points that are far 
from the current point x^*^ the kernel weight function is 
very close to zero and we can ignore those point and do the 
summation only over a local neighborhood of x^*^. 

7.2 Experiments with synthetic data 

Functional relationships We test our MI estimator for 
near-functional relationships of form Y — f{X) +U{0,9), 


We compare our estimator to several baselines that include 
the kNN estimator proposed by ( |Kraskov et al.[ 2004) , 
an estimator based on generalized nearest-neighbor graphs 
(GNN) (Pal et al. 2010| l, and minimum spanning tree 
method (MST) ( ]Yukich and Yukidi] |1998| l. We evaluate 
those estimators for six different functional relationships as 
indicated in Figure]^ We use N = 2500 sample points for 
each relationship. To speed up the optimization, we lim¬ 
ited the summation in Eq.]^to only k nearest neighbors, 
thus reducing the computational complexity from 0{N) to 
0{k) in every iteration step of Algorithm]^ 



































































































One can see from Fig. that when 9 is relatively large, 
all methods except MST produce accurate estimates of MI. 
However, as one decreases 0, all three baseline estimators 
start to significantly underestimate mutual information. In 
this low-noise regime, our proposed estimator outperforms 
the baselines, at times by a significant margin. Note also 
that all the estimators, including ours, perform relatively 
poorly for highly non-linear relationships (the last row in 
Figure [^. According to our intuition, this happens when 
the scale of the non-linearity becomes sufficiently small, 
so that the linear approximation of the relationship around 
the local neighborhood of each sample point does not hold. 
Under this scenario, accuracy can be recovered by adding 
more samples. 


tracted less attention due to their computational complex¬ 
ity. However, advances in computational power allow us to 
re-consider this class of method. 

9 Conclusion and Future Work 

Past research on mutual information estimation has mostly 
focused on distinguishing weak dependence from indepen¬ 
dence. However, in the era of big data, we are often inter¬ 
ested in highlighting the strongest dependencies among a 
large number of variables. When those variables are highly 
inter-dependent, traditional non-parametric mutual infor¬ 
mation estimators fail to accurately estimate the value due 
to the boundary bias. 


8 Related Work 


Mutual Information Estimators Recently, there has been 
a significant amount of work on estimating information- 
theoretic quantities such as entropy, mutual information, 
and divergences, from i.i.d. samples. Methods include 
k-nearest-neighbors (Singh et al. 200^, ( Kraskov et al.| 


spanning trees (|Yukich and Yukich) 

1998|; kernel density 

estimate (Moon et al. 1995)1, (Sin^ 

fh and Poczos 2014)1; 

maximum likelihood density ratio (Suzuki et al. 

20081; en- 

semble methods (|Moon and Heroj 2 

014)1, Sric 

laran et al. 


underestimate the mutual information when two variables 
have strong dependency. (|Gao et ah 2015)1 addressed this 


shortcoming by introducing a local non-uniformity correc¬ 
tion, but their estimator depended on a heuristically defined 
threshold parameter and lacked performance guarantees. 

Density Estimation and Boundary Bias Density estima¬ 
tion is a classic problem in statistics and machine learn¬ 
ing. Kernel density estimation and k-nearest-neighbor den¬ 
sity estimates are the two most popular and successful 
non-parametric methods. However, it has been recognized 
that these non-parametric techniques often suffer from 
the problem of so-called “boundary bias”. Researchers 
have proposed a variety of methods to overcome the bias, 
such as the reflection method (|Schuster 1985|l, ( Silverman 


mum 


1986[); the boundary kernel method (Zhang and Karuna 


pert 


2000jl, the transformation method (|Marron and Rup 


994|l, the pseudo-data method (jCowling and Ha. 


1996)1 and others. All these methods are useful in some 


particular settings. But when it comes to mutual informa¬ 
tion estimation, how can we choose the most efficient one 
to use? It seems that local likelihood method ( )Hjort and ) 
Jones 19961, (Loader 199^, is a good choice for estimat¬ 


ing the mutual information due to its ability to detect the 
boundary without any prior knowledge. Previous studies 
have already proven the power of local regression, which 
can automatically overcome the boundary bias. Methods 
based on local likelihood estimation has traditionally at- 


We have addressed this shortcoming by introducing a novel 
semi-parametric method for estimating entropy and mutual 
information based on local Gaussian approximation of the 
unknown density at the sample points. We demonstrated 
that the proposed estimators are asymptotically unbiased. 
We also showed empirically that the proposed estimator has 
a superior performance compared to a number of popular 
baseline methods, and can accurately measure strength of 
the relationship even for strongly dependent variables, and 
limited number of samples. 

There are several potential avenues for future work. First 
of all, we would like to validate the proposed estimator in 
higher-dimensional settings. In principle, the approach is 
general and can be applied in any dimensions. However, 
the optimization procedure may be computational expen¬ 
sive in higher dimensions, since the number of parameters 
scales as 0{d'^) with dimensionality d. An intuitive solu¬ 
tion would be to initialize the parameters with the results 
obtained from the close points, which can facilitate conver¬ 
gence. 


Another interesting issue is the bandwidth selection, which 
is an important problem in general density estimation prob¬ 
lems. If the bandwidth is too large, the local Gaussian as¬ 
sumption may not be valid, whereas very small bandwidth 
will result in non-smooth densities. Ideally, we would like 
to choose the bandwidth in a way that preserves the lo¬ 
cal Gaussian structure in the neighborhood of each point. 
Another interesting extension would be choosing the band¬ 
width adaptively for each point. 

Finally, while here we have focused on the asymptotic un¬ 
biasedness of the proposed estimator, it will be very valu¬ 
able to establish theoretical results about the convergence 
rates of the estimators, as well as its variance in the large 
sample limit. 


Note added in proof: We have become aware of a very re¬ 


cent paper on non-parametric entropy estimation (Lom- 


)bardi and Pant))2015) l that is also based on local Gaussian 
approximation. Specifically, the kpN estimator suggested 
in ()Lombardi and Pant) )2015)) fits a Gaussian distribution 




















































































with the empirical mean and covariance matrix of the p- 
nearest neighbors of each point, and then uses this distribu¬ 
tion to approximate the probability mass contained in the 
kNN ball centered at that point. Despite obvious similar¬ 
ities, we note that our approach is based on a local mini¬ 
mization of the Kullback-Leibler distance between the true 
and the approximating Gaussian densities, whereas the kpN 
estimator works by fitting a truncated Gaussian distribu¬ 
tion. As a result, we are able to derive formal performance 
guarantees, thus making our approach theoretically better 
grounded. 
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